Terminology Evolution Module for Web Archives in the LiWA Context∗

نویسندگان

  • Nina Tahmasebi
  • Gideon Zenz
  • Tereza Iofciu
  • Thomas Risse
چکیده

More and more national libraries and institutes are archiving the web as a part of the cultural heritage. As with all long term archives, these archives contain text and language that evolves over time. This is particularly true for web archives as content published online is highly dynamic and changing at a fast rate. The language evolution causes gaps between the terminology used for querying and the one stored in long term archives. To ensure access and interpretability of these archives, language evolution must be found and handled in an automatic manner. In this paper we present the LiWA Terminology evolution module, TeVo which takes us one step closer to fully automatic detection of terminology evolution. TeVo consists of a pipeline for finding evolution from web archives based on the UIMA framework. The LiWA TeVo module consists of two main processing chains, the first for Warc file extraction and text processing and the second for finding terminology evolution. We also present the terminology evolution browser, the TeVo browser, which aids in exploring evolution of terms present in archives.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Turning pure Web Page Storages into Living Web Archives

Web content plays an increasingly important role in the knowledge-based society, and the preservation and long-term accessibility of Web history has high value (e.g., for scholarly studies, market analyses, intellectual property disputes, etc.). There is strongly growing interest in its preservation by libraries and archival organizations as well as emerging industrial services. Web content cha...

متن کامل

Web Spam: a Survey with Vision for the Archivist

While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used byWeb archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the st...

متن کامل

Terminology Evolution in Web Archiving: Open Issues

The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systemati...

متن کامل

First Results on Detecting Term Evolutions∗

ABSTRACT The archival of content like publications or web pages is just the first step toward “full” content preservation. It also has to be guaranteed that content can be found and interpreted in the long run. The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, a...

متن کامل

Temporal search in web archives

Web archives include both archives of contents originally published on the Web (e.g., the Internet Archive) but also archives of contents published long ago that are now accessible on the Web (e.g., the archive of The Times). Thanks to the increased awareness that web-born contents are worth preserving and to improved digitization techniques, web archives have grown in number and size. To unfol...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010